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1.  Summary 


In  this  project  we  developed  a  prototype  Shopforeman  system  that  composes  heterogeneous 
computer  vision  algorithms  on  the  fly  along  with  a  complementary  Nightwatchman  system 
that  automatically  generates  novel  concept  classiflers  for  the  Shopforeman  to  use.  The 
Shopforeman  provides  an  interface  to  the  user  that  allows  them  to  specify  the  specific  problem 
they  are  trying  to  solve  and  a  target  operating  point.  The  goal  of  this  project  is  to  explore  the 
design  of  an  automated  interface  to  a  heterogeneous  collection  of  computer  vision  algorithms 
and  complementary  approaches  to  develop  concept  classiflers  and  datasets  that  it  uses. 

We  explore  these  issues  broadly  and  present  an  approach  that  is  capable  of  modeling 
complex  computer  vision  systems  while  operating  on  large  data  collections  using  classiflers, 
image  search  indexes,  human  annotators,  and  heterogeneous  computer  vision  algorithms. 
Processing  is  performed  using  the  Apache  Hadoop  cluster  execution  framework  and  data  is 
stored  using  Apache  HBase.  The  role  of  this  project  is  to  explore  an  alternative  approach 
to  that  taken  by  the  TAl  and  TA2  Visual  Media  Reasoning  (VMR)  participants  and  iden¬ 
tify  what  components  are  most  useful  to  the  program  overall.  Our  data  storage  (HBase), 
computer  vision  system  (Picarus),  and  classiflers  were  transitioned  to  the  TA2  teams. 
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2.  Introduction 


Many  natural  distributions  follow  a  power  law  distribution  and  when  viewed  with  entities 
sorted  by  frequency  they  form  a  long  tailed  distribution.  For  our  purposes,  the  most  frequent 
entities  that  occur  in  a  database  such  as  people,  faces,  and  vehicles  have  substantial  resources 
dedicated  to  detecting  them  accurately;  however,  outside  of  the  few  most  common  entities 
the  rest  are  far  less  frequent  and  make  up  substantially  more  of  the  database’s  diversity. 
The  challenge  for  our  project  is  how  to  improve  accuracy  on  the  far  end  of  this  distribution 
where  few  samples  exist  for  each  entity  and  it  becomes  cost  prohibitive  to  dedicate  signihcant 
manual  resources  to  developing  custom  algorithms  for  them.  Figure  1  illustrates  this  concept 
graphically,  where  pictures  of  people  occur  frequently  but  pictures  of  a  specihc  village  may 
only  occur  rarely.  In  this  project  we  develop  algorithms  that  operate  abstractly  which  allows 
our  approach  to  be  adapted  to  a  wide  range  of  visual  classihcation  tasks.  To  do  this  we 
automate  standard  experimental  protocols  to  develop  new  concept  classihers  and  datasets 
using  annotators  to  simplifying  the  interface  for  the  operator. 

Figure  2  provides  an  overview  of  the  system  architecture.  The  Shopforeman  and  Night- 
watchman  are  the  operator  facing  applications  and  use  the  open  source  Picarus  machine 
learning  system  which  abstracts  data  storage,  processing,  web  crawling,  and  annotation 
tasks.  The  Shopforeman  employs  concept  classihers  generated  by  the  Nightwatchman.  In 
turn,  the  Nightwatchman  uses  the  available  storage,  processing,  crawling,  and  annotation 
infrastructure  to  produce  the  concept  classihers. 

The  Shopforeman  uses  a  type  system  to  compose  computer  vision  algorithms  that  satisfy 
the  task  input  by  the  operator.  For  example,  if  we  are  given  a  series  of  photos  and  told  to 


Figure  1:  Relative  frequency  of  various  concepts  in  a  corpus.  Ranging  from  people,  vehicles, 
weapons,  specihc  landscapes,  and  specihc  buildings. 
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Figure  2:  Overall  system  architecture 


Figure  3:  Example  concept  types  considered  by  the  system. 

recognize  the  people  that  appear  in  them,  the  system  knows  that  1.)  it  must  be  a  photo 
and  not  an  abstract  graphic  (e.g.,  a  drawing  in  a  document),  2.)  it  must  have  a  face  in  it, 
and  3.)  after  recognition  we  want  a  list  of  names  of  the  people  in  the  photo.  This  structure 
is  highly  flexible  and  allows  for  adding  new  algorithms  and  types.  This  structure  abstracts 
the  entire  system  into  specifying  input  and  output  types,  with  the  inference  algorithm  able 
to  determine  what  approaches  are  available  to  satisfy  it  by  performing  a  depth  Erst  search. 
The  system  is  also  able  to  rehne  the  input  type  of  an  image  for  example  by  determining  if 
it’s  indoors,  a  logo,  or  contains  people.  Figure  7  illustrates  the  different  levels  of  concepts 
that  we  aim  to  classify. 
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Figure  4:  Shopforeman  detected  that  the  image  was  a  photo  taken  outdoors  in  a  natural 
setting.  Since  it  is  a  photo  it  attempted  to  extract  faces  from  the  image  but  none  were  found. 
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Figure  5:  Shopforeman  detected  that  the  image  was  a  photo  taken  outdoors  in  a  natural 
setting.  Since  it  is  a  photo  it  extracted  the  face  present. 
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Figure  6:  Shopforeman  detected  that  the  image  was  not  a  photo  (e.g.,  an  image  taken  directly 
from  a  camera)  and  proceeded  to  detect  if  the  image  is  a  logo.  It  found  that  it  is  most  likely 
an  Apple  logo  using  the  Local  Naive  Bayes’  Nearest  Neighbor  algorithm. 
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3.  Methods,  Assumptions, 
and  Procedures 


3.1  Shop  Foreman 

3.1.1  Types 

Each  concept  classifier  has  one  input  and  output  type.  The  type  for  newly  input  images  is 
“raw  image”  and  as  concept  classifiers  positively  identify  additional  attributes  of  the  image  it 
is  further  refined.  The  type  system  provides  a  straightforward  way  to  determine  if  a  task  can 
be  accomplished  and  what  methods  are  available  to  do  so.  Moreover,  the  concept  classifiers 
need  to  only  operate  over  the  domain  of  their  input  type  (e.g.,  face  recognition  is  only 
trained/evaluated  on  photos,  not  logos  or  screenshots)  which  improves  their  discriminative 
power  substantially  as  compared  to  operating  on  possible  images.  This  means  that  the  type 
system  implicitly  induces  a  rejection  chain  cascade,  where  for  an  image  to  be  evaluated  by 
a  classifier  all  preceding  classifiers  need  to  positively  confirm  their  type.  This  configuration 
is  widely  used  and  is  particularly  useful  at  efficiently  processing  imagery  as  fewer  classifiers 
are  evaluated  on  each  image. 

3.1.2  Datasets 

For  machine  learning  tasks  the  data  used  for  training  and  evaluation  is  often  the  key  to 
achieving  acceptable  accuracy.  Problems  encountered  include  domain  mismatch  (e.g.,  train¬ 
ing  performed  on  a  different  domain  that  we  evaluate  with),  training  errors,  intra-class 
variation  (e.g.,  vehicles  include  busses,  motorcycles,  and  cars),  between  class  variation  (e.g., 
bicycles  and  motorcycles  are  semantically  different  in  day-to-day  use  but  can  look  similar), 
and  dearth  of  training  samples.  For  our  task  the  problem  is  more  complex  since  long  tail 
classes  necessarily  have  few  training  samples  and  we  are  limited  to  crawling  publicly  avail¬ 
able  image  repositories  and  datasets.  While  existing  computer  vision  datasets  are  a  good 
resource  they  tend  to  cover  only  common  classes  and  are  often  heavily  biased  in  ways  that 
make  extrapolating  their  performance  in  a  real  world  setting  challenging.  However,  collecting 
fully  custom  datasets  for  our  task  is  impractical  due  to  the  wide  scope  and  limited  resources 
available.  For  our  tasks  we  use  a  set  of  resources  that  support  crawling  by  query  (e.g.,  Flickr, 
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Figure  7:  Directed  Acyclic  Graph  (DAG)  of  a  selection  of  the  types  used  by  the  Shopforeman. 

Google  Maps,  and  Google  Images)  along  with  commonly  used  vision  datasets  (e.g.,  SUN397, 
Imagenet,  and  Labeled  Faces  in  the  Wild). 

3.1.3  Components 

While  the  concept  classihers  are  trained  for  novel  classes,  their  underlying  algorithms  are 
drawn  from  a  pool  of  generally  applicable  computer  vision  algorithms.  For  image  classihca- 
tion  features  we  use  Gist  [10],  bag-of- visual-words,  and  spatial  pyramid  histograms [6].  For 
image  classihcation  we  use  support  vector  machines  with  histogram  intersection  and  lin¬ 
ear  kernels.  For  image  search  tasks  we  use  spherical  hashing[4]  to  produce  compact  binary 
codes  from  feature  vectors.  For  search  indexing  we  use  multi-index  hashing [9]  which  enables 
sub-linear  hamming  distance  computation. 

We  developed  a  new  classiher  based  on  the  Local  Naive  Bayes  Nearest  Neighbor[8]  method 
(a  non-parametric  nearest  neighbor  based  classiher)  for  multi-class  recognition  problems  such 
as  landmark  and  logo  recognition.  This  classiher  suits  this  project  well  as  it  supports  many 
classes  with  relatively  few  examples,  compared  to  Support  Vector  Machines  which  have 
difficulty  in  supporting  many  classes  and  operating  with  few  positive  training  examples. 

Where:  Indoor/Outdoor  Scene  Classification 

As  an  illustrative  example  we  show  results  on  indoor/outdoor  classihcation.  The  images 
used  are  from  the  SUN397  dataset  with  19,850  images  used  for  both  training  and  testing.  A 
linear  SVM  is  used  with  bag-of-visual-words  of  histogram  of  oriented  gradients  features. 

Gurves  detailing  the  Indoor/Outdoor  classiher  performance,  with  Indoor  as  the  positive 
class  and  Outdoor  as  the  negative  class.  The  top  2  curves  (unnormalized  and  normalized) 
are  histograms  of  classihcation  conhdence,  these  show  visually  how  separable  the  two  classes 
are  given  the  classiher’s  conhdence  value.  The  third  curve  shows  the  distribution  of  accuracy 
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Figure  8:  Indoor /Outdoor  classifier  performance  on  the  SUN397  dataset. 


(TP  +  TN)  /  (  TP  +  TN  +  FP  +  FN)  as  a  function  of  the  confidence  threshold.  The  SVM 
was  set  to  treat  errors  equally.  The  circles  in  all  of  the  curves  corresponds  to  the  same 
threshold  (thresh  =  0)  which  is  where  the  accuracy  is  maximized.  The  last  two  hgures  are 
standard  PR  (TPR  =  Recall  =  (TP  /  (TP  +  FN))  and  Precision  =  (TP  /  (TP  +  FP))  and 
ROC  curves  (TPR  =  Recall  =  (TP  /  (TP  +  FN)  and  FPR  =  (FP  /  (TN  +  FP)). 

What/Where:  Semantic  Scene  Parsing 

For  scenes  that  are  primarily  outdoors,  knowing  the  basic  structure  of  the  scene  (e.g.,  ground, 
sky,  vegetation,  man-made  structure,  water,  etc.)  can  help  signihcantly  in  determining  high 
level  scene  properties  (e.g.,  is  there  is  a  body  of  water,  is  there  a  building,  etc.)  and  in 
determining  scene  classihcation  labels  (e.g.,  jungle,  city,  desert,  ocean).  To  perform  this 
task,  we  have  an  efficient  scene  parsing  algorithm  based  on  Semantic  Texton  Forests [12]  that 
performs  an  efficient  (sub-second)  pixel  level  classihcation. 

What:  Logo  Recognition 

As  an  example  of  entity  recognition  we  evaluated  the  Local  Naive  Bayes’  nearest  neighbor 
classiher  on  a  dataset  of  the  100  top  brands  from  Business  Week  (query  Google  Images  with 
human  verihcation)  with  486  (80%)  of  the  images  used  for  training  and  124  images  (20%) 
used  for  evaluation.  The  image  features  used  are  multi-scale  LAB  color  space  image  blocks. 
Random  chance  on  this  problem  is  approximately  1%  and  this  conhguration  achieved  49% 
accuracy  (i.e.,  the  hrst  ranked  result  is  correct). 

What:  Landmark  Recogntiion 

Another  example  of  entity  recognition  is  demonstrated  on  recognizing  one  of  four  DC  land¬ 
marks  (lincoln  memorial,  tho mas  Jefferson  memorial,  us  capitol,  Washington  monument)  from 
Flickr  images.  The  training  set  consisted  of  232  (80%)  images  and  the  testing  set  had  54 
(20%)  images.  Chance  in  this  task  is  approximately  25%  and  we  evaluated  this  approach 
on  a  variety  of  image  features:  Dense  HOG  (16x16)  70%,  Dense  HOG  (32x32)  79%,  Dense 
HOG  (64x64)  79%,  SURF  83%.  In  this  task  the  images  are  nnconstrained  and  have  similar 
landscapes  as  they  are  taken  from  the  same  region,  no  task  specihc  tnning  was  done  so 
that  we  can  evalnate  the  baseline  classihcation  performance  which  can  be  expected  from  the 
concepts  antomatically  generated  by  the  Night  watchman. 

3.1.4  Picarus 

The  nnderlying  requirements  of  having  algorithmic  and  hnman  modules  for  the  Shop  Fore¬ 
man  to  reason  over  necessitates  an  early  focus  on  developing  base  implementations  of  these 
modules.  The  goal  is  to  produce  a  sufficient  nnmber  of  modules  to  demonstrate  the  effec¬ 
tiveness  of  our  Shop  Foreman  algorithm.  To  facilitate  this,  we  are  using  the  open-source 
Picarus  library  (Brandyn  White  is  the  primary  developer)  to  produce  classihers  and  search 
indexes  that  are  the  core  of  the  nnderlying  algorithmic  modules  and  to  facilitate  interactive 
hnman  annotation  for  hnman  inpnt  (both  by  users  and  by  annotation  workers).  Picarus  is  a 
web-service  that  executes  large-scale  visual  analysis  jobs  using  Hadoop  with  data  stored  on 
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HBase.  This  approach  simplihes  integration  with  TA1/TA2  teams  as  they  can  be  given  ac¬ 
cess  to  the  web-service  directly  which  has  a  REST  (common  HTTP  data-access  convention) 
interface  and  web  application. 

As  the  entire  backend  is  available  open  source  and  we  installed  a  Picarus  cluster  on  Ama¬ 
zon’s  Gov  Cloud  for  participating  teams  to  use.  Adapted  existing  security,  administration, 
and  data  management  to  VMR  requirements.  Installed  Picarus  (which  requires  Hadoop, 
HBase,  and  Redis)  on  two  govcloud  servers.  Wrote  up  documentation  for  picarus  adminis¬ 
tration  http:/ /goo.gl/fDWqhV.  Frequent  interactions  and  consultation  with  SET/Blackriver 
teams  to  transfer  control  of  the  server  and  ensure  proper  instructions  were  available  for  ex¬ 
pected  tasks. 

3.2  Cascade  Calibration 

We  introduce  a  classiher  and  parameter  selection  algorithm  for  Classi£er-as- a- Service  appli¬ 
cations  (i.e.,  the  model  of  our  Shopforeman  and  Nightwatchman)  where  there  are  many  com¬ 
ponents  (e.g.,  features,  kernels,  classifiers)  available  to  construct  classification  algorithms. 
Queries  specify  varying  requirements  (i.e.,  quality  and  execution  time),  some  of  which  may 
require  combining  classification  algorithms  to  satisfy;  each  query  may  have  a  different  set  of 
quality  and  execution  time  requirements  (e.g.,  fast  and  precise,  slow  and  thorough)  and  the 
set  of  images  to  which  the  classifier  is  to  be  applied  may  be  small  (e.g.,  even  a  single  image), 
necessitating  a  query  resolution  method  that  takes  negligible  time  in  comparison.  When 
operating  on  large  datasets,  meeting  design  requirements  automatically  becomes  essential  to 
reducing  costs  associated  with  unnecessary  computation  and  expert  assistance.  As  queries 
specify  requirements  and  not  implementation  details,  additional  components  can  be  utilized 
naturally.  Our  query  resolution  method  combines  classifiers  with  complementary  operating 
points  (e.g.,  high  recall  algorithmic  filter,  followed  by  high  precision  human  verification)  in 
a  rejection-chain  configuration.  Experiments  are  conducted  on  the  SUN397[14]  dataset;  we 
achieve  state-of-the-art  classification  results  and  1  m.s.  query  resolution  times. 

This  cascade  calibration  approach  is  used  throughout  our  system.  The  Shopforeman 
employs  it  to  resolve  which  concept  classifier  to  use  for  a  given  task.  The  Nightwatchman  uses 
it  for  selecting  which  concept  classifiers  formed  in  a  cascade  are  most  effective  at  satisfying 
an  operator  provided  target  operating  criteria. 

3.2.1  Proposed  Approach 

We  describe  a  method  for  efhciently  determining  a  combination  of  classifiers  and  thresh¬ 
olds  sufhcient  to  achieve  a  desired  per-class  quality  and  overall  execution  time  specification, 
provided  as  a  “query”.  Query  resolution  involves  combing  classifiers  into  a  rejection-chain 
cascade  which  satisfy  quality  and  execution  time  requirements  as  close  as  possible.  Our  pri¬ 
mary  contribution  is  the  development  of  an  efficient  method  to  determine  the  performance 
characteristics  of  these  cascades  offline,  so  that  they  can  be  queried  efficiently  online.  Given 
pre-trained  cascade  stages  S  (Sec.  3.2.2),  we  predict  each  class  I  G  C  using  each  cascade 
stage  S'  G  iS  producing  a  confidence  matrix  with  |£|  rows  and  N  columns,  where  N  is  the 
size  of  a  validation  set.  We  illustrate  our  approach  on  the  scene  recognition  task,  where  each 
image  belongs  to  exactly  one  class  and  the  validation  groundtruth  ^  is  a  matrix  of  the  same 
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Figure  9:  (left)  Proposed  cascade  training,  query,  and  execution  approach,  (right)  Example 
cascade  stage  instances. 

shape  as  with  values  gi^i  G  {  —  1, 1}  for  validation  image  i.  For  each  cascade  stage  S  the 
threshold  selection  (Sec.  3.2.3)  produces  a  set  of  thresholds  Ti^  that  capture  that  cascade 
stage’s  performance.  The  cascade  selection  produces  Ci  C  V{S)  where  V{S)  is  the  powerset 
of  the  set  of  cascade  stages  S.  Finally  the  cascade  simulation  (Sec.  3.2.4)  simulates  each 
cascade  C  &  Ci  using  the  thresholds  selected  where  S  E  C.  The  simulation  is  an  efficient 
method  of  evaluating  the  performance  of  the  cascades  and  produces  operating  points  (i.e., 
confusion  matrices  and  times)  that  are  then  stored  in  the  cascade  database.  This  training 
phase  operates  on  each  class  I  independently  and,  for  notational  convenience,  we  let  g  =  gi, 

j-s  ^  j-s^  Q  ^ 

3.2.2  Cascade  Stages 

A  cascade  stage  S  takes  an  image  /  and  a  set  of  classes  C  from  which  it  produces  a  (possibly 
sparse)  conhdence  vector  .  For  the  single  class  case  in  Fig.  9(a),  this  is  a  single  image 
feature  /  and  a  single  binary  classiher  ci  which  produces  the  conhdence  vector  x^  as  xf  E- 
Ci(/(/)).  For  the  general  form  in  Fig.  9(b),  a  single  feature  /  is  shared  by  many  classihers 
and  x^  is  produced  as  V/  E  C,xf  E-  ci{f{I)).  We  now  provide  a  dehnition  for  a  cascade 
stage  S  where  /  is  an  image  and  £  is  the  set  of  classes  (represented  as  positive  integers): 
(1)  The  cascade  stage  is  dehned  for  all  £,  x^  E-  S{I,C)  where  S  may  be  non-deterministic 
and  x^  is  a  real-valued  vector,  (3)  V/,  m  E  C,  conhdence  value  xf  is  independent  of  m  E  C 
OT  m  ^  C  where  I  ^  m,  and  (4)  Larger  values  of  xf  signify  higher  conhdence  that  the  input 
belongs  to  class  1. 

3.2.3  Threshold  Selection 

Given  a  trained  cascade  stage  S  and  a  validation  set,  our  task  is  to  hnd  a  set  of  thresholds 
that  compactly  represents  its  operating  points.  The  threshold  selection  occurs  independently 
for  each  class  and  cascade  stage  S.  We  wish  to  minimize  the  number  of  thresholds  |T'^|  to 
reduce  the  cascade  simulation  complexity.  We  say  a  conhdence  value  xf  is  positive  with 
respect  to  f  G  if  xf  >  t.  The  initial  set  of  thresholds  is  U  {cxd},  where  is  the 

set  of  conhdence  values  of  xs-  Including  inhnity  ensures  that  at  least  one  threshold  has  no 
false  positives.  This  set  is  sufficient  as  any  other  threshold  produces  a  redundant  partition; 
however,  it  is  not  necessary  as  thresholds  which  produce  “worse”  confusion  matrices  may  be 
present  (i.e.,  more  FP  or  FN  errors).  We  construct  a  subset  of  that  is  both  necessary 
and  sufficient.  Given  a  conhdence  value  xf  and  ground  truth  label  fi  for  each  image  i 
in  the  validation  set,  we  sort  them  ascending  by  conhdence  value  with  positive  ground 
truth  instances  listed  before  negative  ones  for  the  same  conhdence  values.  Observe  that 


Approved  for  Public  Release;  Distribution  Unlimited. 
12 


the  secondary  sorting  eliminates  ‘overestimated’  confusion  matrices  that  can  result  from 
naive  generation  of  confusion  matrices  with  the  same  conhdence  value  but  different  ground 
truth  polarities[l].  As  this  process  is  independent  of  the  stage  S',  we  let  ^  ,  x  =  x  , 

and  A  =  When  operating  on  the  vector  g,  represents  the  ground  truth  polarity  at 
position  f ,  with  Xi  as  its  associated  conhdence  value  and  g^_i  as  its  neighbor  in  the  descending 
direction. 

We  hud  the  minimum  number  of  thresholds  Te  required  to  exactly  represent  the  per¬ 
formance  characteristics  of  the  cascade  stage.  The  predicate  Keep{xi)  is  true  when  g^_j^  is 
not  positive  and  g^  is  not  negative;  this  clearly  produces  the  desired  set  as  Te  =  {xi  G  X  : 
Keep{xi)}.  More  practically,  we  can  relax  our  exact  representation  by  allowing  for  a  bounded 
absolute  precision/recall  difference,  ;2,  between  the  confusion  matrices  produced  by  Te  and 
a  subset  Ta,z  of  them,  leading  to  0(1)  thresholds.  This  is  accomplished  by  assigning  each 
threshold  in  7^  as  a  node  in  a  graph  with  edges  induced  by  2;  and  computing  the  dominating 
set,  which  correspond  to  thresholds  in  Ta,z- 

3.2.4  Cascade  Selection/Simulation 

We  evaluate  two  methods  for  selecting  which  candidate  cascades  of  length  £  should  be  used 
for  simulation:  dense  (O)  and  sparse  (S').  The  sparse  method  selects  the  f3  stages  with 
highest  rank  correlation  to  the  ground  truth  and  limits  the  branching  factor  to  a  stages  with 
lowest  minimum  rank  correlation  with  preceding  stages.  Given  a  set  of  cascades  C  for  a  class 
with  an  associated  set  of  thresholds  T^  where  S  E  C  and  C  E  C,  the  goal  is  to  efficiently 
compute  a  cascade  database  that  contains  the  union  of  all  cascades  in  C  that  can  be  formed 
over  all  of  their  thresholds.  This  process  is  a  simulation  as  we  are  not  computing  the  cascade 
performance  directly;  rather,  we  find  the  confidence  values  for  all  stages  independently  and 
exactly  compute  their  combined  performance  offline.  The  output  for  each  set  of  thresholds 
for  a  cascade  includes  a  binary  confusion  matrix,  stage  names,  stage  thresholds,  and  %  of 
inputs  evaluated  at  each  stage  (used  to  compute  time). 

A  naive  approach  would  evaluate  the  validation  set  using  every  possible  chain  and  its 
corresponding  set  of  thresholds.  There  are  a  combinatorial  number  of  operations  in  the  length 
of  the  cascade  if  every  possible  ordering  of  a  single  cascade  (e.g.,  A^B^C,  C^B^A)  were 
considered;  however,  the  order  does  not  effect  the  resulting  binary  confusion  matrix.  We 
order  the  stages  by  running  time  per  image.  When  evaluation  a  cascade  A^B^C,  we 
have  already  computed  A  and  A^B  which  can  be  re-used  for  longer  cascades  with  common 
prehxes  (e.g.,  A^B  reused  in  A^B^C  and  A-e-B^D).  Stages  with  the  same  parent  are 
all  computed  simultaneously.  The  computational  complexity  of  this  method  is  the  sum  of 
the  complexity  for  each  cascade  from  a  leaf  to  a  root  node.  The  complexity  of  each  path 
is  bounded  by  0(A^n5ecl”^'^l)  where  N  is  the  number  of  validation  inputs  (e.g.,  images  for 
image  classihcation).  However,  this  worst  case  performance  cannot  occur  since  the  number 
of  validation  inputs  N  reduces  from  stage  to  stage,  exactly  as  they  do  when  passing  through 
a  rejection  chain  cascade. 
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Figure  10:  (left)  Cascade  recognition  rates  of  features/kernels  for  our  approach,  (right)  Ex¬ 
isting  {^0-3),  upper  bound  {4^4-5),  and  (t^^O-IS)  method  variations  (threshold  method,  max 
cascade  length  i,  cascade  selection  method).  The  threshold  selection  methods  (Sec.  3.2.3) 
are  Te  and  Ta,z-  The  cascade  selection  methods  (Sec.  3.2.4)  are  D  and  with  jS  as  half  of 
the  stages. 


3.2.5  Experiments 

We  show  experimental  results  on  the  SUN397[14]  scene  recognition  dataset  which  consists 
of  39,700  images  (50  train/test  per  class,  using  partition  41^1).  Fig.  10(a)  summarizes  the 
features  used  along  with  their  recognition  rate  for  multi-class  classihcation  using  linear  and 
histogram  intersection  kernels  (HIK)  over  all  of  the  features.  LABhist  uses  the  CIE  L*a*b* 
colorspace  with  4  bins  for  L  and  11  bins  for  a  and  b.  TextonForest  uses  the  method  proposed 
in  [12]  (trained  on  the  MSRC  dataset)  to  compute  spatial  pyramids  on  the  maximum  label 
mask.  The  Tinylmage[13]  feature  is  a  common  baseline  for  scene  recognition.  The  image  is 
resized  to  32x32  in  the  CIE  L*a*b*  colorspace.  ObjectBank  (OB)  [7]  is  a  method  that  pools 
object  detector  predictions  to  produce  a  highly  discriminative  feature.  Spatial  pyramids 
(used  by  HOG2x2,  LABhist,  and  Texton)  are  of  scales  Using  a  spatial  pyramid  sig- 

nihcantly  improves  performance  when  used  with  HIK  at  marginal  additional  computational 
cost.  Mechanical  Turk  indoor  classihcation  is  included  as  a  cascade  stage  with  conhdence 
estimated  from  response  time  Fig.  10(7)^13-15),  the  best  algorithmic  result  is  bold. 

As  we  are  evaluating  binary  classihcation  performance,  we  use  the  mean  area  under 
the  ROC.  Our  method  produces  a  signihcant  improvement  (compared  to  Fig.  10(7)^0-3)) 
using  similar  features,  classihers,  and  kernels  Fig.  10(7)^7).  It  is  clear  that  this  gain  is  from 
the  cascade  design  as  the  features  alone  Fig.  10(7)^6)  perform  worse.  We  use  20%  of  the 
training  set  as  a  validation  set  to  learn  the  cascade  database  with  the  classihers  trained  on 
the  other  80%.  We  are  able  to  calculate  an  upper  bound  for  our  method  by  identifying  the 
optimal  cascades  and  thresholds  on  the  test  set  Fig.  10(7)^4-5);  the  approach  has  near  ideal 
performance.  To  gain  further  insight  we  show  the  number  of  classes  with  AUC  better,  same, 
or  worse  (>,  =,  <)  than  [14]  for  each  result  in  Fig.  10  (our  method  uses  20%  less  training 
than  [14]  due  to  calibration). 

To  evaluate  query  performance,  for  all  classes  and  for  constraints  of  length  1  to  5  (i.e., 
precision,  recall,  time,  accuracy,  F-1)  we  compute  100  random  example  queries  (e.g..  House 
with  p  >  .4,  r  >  .3,  and  t  <  1)  with  a  cap  of  100  returned  cascades  using  the  cascade 
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database  produced  by  Fig.  10(7^14).  The  median  query  times  (in  ascending  order  from  1-5 
constraints)  are:  1.1,  3.5,  1.0,  .1,  and  .09  in  m.s.;  The  additional  constraints  reduce  the 
search  space  dramatically.  This  is  fast  compared  to  the  feature/classiher  computation  and 
is  only  performed  once  per  query  resolution,  not  per  image  as  in [2].  Multi-class  queries 
(i.e.,  quality  per  class,  overall  time)  produce  per-class  candidates  and  the  overall  time  is 
minimized.  Computations  were  performed  on  a  2.2GHz  Xeon. 

3.3  Night  watchman 

The  Nightwatchman  produces  concept  classihers  for  the  Shopforeman  to  use.  To  do  this 
it  orchestrates  an  iterative  crawl,  annotate,  train,  and  evaluate  loop.  Figure  11  shows  the 
high  level  steps  that  the  Nightwatchman  performs.  The  initial  phase  updates  which  slices 
of  data  should  be  annotated  for  each  of  the  training/validation  sets.  The  updated  slices 
are  annotated  and  classihers  are  created  from  the  annotated  training  samples.  The  trained 
concept  classihers  are  then  used  to  predict  the  class  labels  of  the  validation  set  and  from 
these  prediction  values  we  are  able  to  calibrate  the  classihers,  using  the  previously  described 
approach,  which  allows  us  to  know  how  ehective  they  are  and  what  thresholds  are  required 
to  achieve  our  target  operating  point. 

3.3.1  Annotation 

Many  of  the  High-Level  tasks  (i.e.,  those  that  inform  a  Who/What/Where/When  such  as 
landmark  identihcation)  require  signihcant  training  data,  and  few  existing  databases  have 
relevant  imagery  and  fewer  have  relevant  annotation.  Our  approach  has  been  to  utilize  large 
imagery  collections  such  as  Flickr  and  Google  Images  to  obtain  an  initial  dataset  and  then 
manually  annotate  them  using  Mechanical  Turk. 

Besides  simplifying  the  data  collection  process,  by  observing  how  the  annotators  interpret 
their  prompts  and  directions,  we  are  able  to  better  understand  the  task  we  are  attempting  to 
automate.  For  example,  when  asked  to  annotate  furniture  on  a  diverse  dataset,  annotators 
often  disagreed  on  what  that  meant.  Ghairs,  tables,  and  beds  were  obvious,  but  what 
about  pool  tables,  park  benches,  kitchen  counters,  and  arcade  machines?  This  is  a  sign  that 
furniture  may  not  be  a  suitable  class  and  perhaps  hner  grained  classes  (bed,  chair,  table, 
etc.)  would  not  only  simplify  annotation  but  also  simplify  interpretation  of  the  classiher’s 
results  in  real-world  use. 

Figure  12  shows  the  interface  provided  to  annotators.  It  includes  a  high  level  description 
of  the  annotation  process  and  a  class  specihc  description  that  disambiguates  any  complex 
cases  that  may  cause  annotators  to  disagree  on  what  label  an  image  should  be.  This  interface 
is  for  binary  class  labels  to  allow  for  quickly  annotating  a  large  set  of  images,  requiring  only 
a  single  button  press  per  annotation  with  new  images  preloaded  to  instantly  produce  a  new 
unlabeled  image.  Figure  13  shows  the  resulting  annotations  for  “outdoor”. 

3.3.2  Active  Learning:  Batch 

In  batch  active  learning  mode  we  evaluate  a  corpus  for  the  samples  that  give  us  the  most 
information  if  they  are  annotated.  Three  distinct  sets  of  data  are  collected  for  annotation: 
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Figure  11:  High  level  flow  of  the  Nightwatchman’s  iterative  process. 
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Image  Class  Annotator 


Instructions 

Determine  if  the  image  has  or  is  representative  of  the  class  specified  below.  A  'class'  could  be 
n  the  image  Ce-Q-,  object,  scene)  or  a  property  of  the  image  Cs-S-i  image  quality,  image  source). 
Hotkeys  are  provided  to  allow  you  to  select  and  submit  results.  If  you  are  unsure  then  skip. 

Class:  outdoor 

Description:  Photo  is  taken  outdoors  Cnot  inside  a  car  or  a  building) 


Image 


Image  belongs  to  the  class? 

•  True  (a) 

•  False  (b) 

Skip 


Figure  12:  The  interface  provided  to  annotators  for  the  label  “outdoor”. 
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Modify 


Positive  Samples  824  /  1247 


Figure  13:  The  resulting  annotations  from  Mechanical  Turk  for  the  label  “outdoor”. 


uniformly  random  sample  of  validation  images,  uniformly  random  sample  of  training  images, 
and  a  sample  of  strongly  predicted  positives/negatives.  This  strategy  ensures  that  1.)  the 
validation  set  is  randomly  sampled  which  allows  us  to  evaluate  performance  statistics  inde¬ 
pendent  of  the  concept  classihers  effectiveness,  2.)  the  training  is  likely  to  have  a  balanced 
set  of  positive/negative  examples,  and  3.)  if  the  concept  classihers  perform  poorly  the  uni¬ 
form  sampling  will  overcome  any  systemic  biases  in  the  long  run.  The  features  and  classihers 
are  those  previously  specihed. 

3.3.3  Keyword  Selection 

When  given  a  large  set  of  input  classes  they  will,  in  general,  not  all  be  ehective  candidates  for 
automatic  discovery.  For  example,  it  is  common  for  photo  tags  to  have  entirely  non-visual 
entries  such  as  the  name  of  the  photographer.  Ideally  we  could  prune  these  to  make  more 
efficient  use  of  our  available  resources.  Another  failure  mode  is  when  the  image  features 
extracted  don’t  capture  the  particular  dehning  characteristics  of  the  class,  such  as  a  picture 
being  “exciting”  which  may  have  primarily  visual  characteristics  but  is  unlikely  to  be  cap¬ 
tured  with  low  level  image  features.  Semantic  ambiguity  (e.g.,  is  a  pool  table  a  table?)  can 
cause  annotators  to  disagree  about  what  class  an  image  is  and  the  resulting  classiher  will 
at  least  as  unsure.  The  datasets  being  mined  may  not  have  any  positive  or  even  negative 
instances  (e.g.,  if  the  dataset  is  all  outdoors  and  the  class  is  ‘outdoors’)  of  the  class.  As  the 
learning  process  is  continuous  and  iterative,  we  will  eventually  reach  the  useful  limit  for  a 
given  class  and  should  focus  on  other  classes.  These  failure  modes  and  others  are  avoided  by 
tuning  the  selection  algorithm  to  focus  on  classes  with  1.)  high  inter-annotator  agreement. 
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Figure  14:  High  level  overview  of  how  data  is  stored  in  HBase  for  the  Night  watchman. 

2.)  high  annotator  and  classifier  agreement,  and  3.)  have  a  significant  marginal  gain  in 
accuracy  between  iterations. 

This  is  closely  related  to  a  classical  problem  called  multi-armed  bandit  problem,  where 
there  are  a  row  of  slot  machines  with  unknown  payout  rates  and  you  have  a  fixed  number 
of  attempts  to  use  them  with  the  goal  of  maximizing  your  total  reward.  This  results  in 
a  tradeoff  between  exploration  (e.g.,  attempting  to  estimate  the  payout  rate  of  as  many 
machines  as  we  can)  and  exploitation  (e.g.,  use  the  machine  that  has  the  highest  known 
payout  rate).  In  our  situation  we  can  devise  an  objective  function  that  models  our  overall 
satisfaction  with  the  results  (e.g.,  #  of  classes  that  meet  specihed  target  operating  points). 
This  approach  is  able  to  dynamically  add  classes  and  datasets  during  operation. 

Figure  14  shows  how  data  is  annotated  and  evaluated  by  the  Nightwatchman.  Images 
are  stored  in  HBase  which  allows  them  to  be  store  contiguously  on  disk  for  high  throughput 
processing.  The  initial  dataset  is  partitioned  into  training  and  validation  slices  and  we  anno¬ 
tate  images  in  descending  order  from  the  beginning  of  the  key  space.  A  separate  contiguous 
training  set  that  corresponds  to  hard  samples  mined  from  a  large  corpus  is  maintained  to 
further  improve  classification. 


3.3.4  Example 

Next  we  provide  an  example  of  how  classes  are  added  to  the  Nightwatchman  and  the  an¬ 
notation,  classification,  and  evaluation  loop  is  initiated.  The  bulk  images  are  crawled  from 
Flickr  and  five  popular  tags  are  used:  2014,  City,  Forest,  Ocean,  and  Sky. 


python  nwm.py 
python  nwm.py 
python  nwm.py 
python  nwm.py 


add_task  bwhitenwm:  bwhite@cs.umd.edu  5 
add_class  zvjsox8ZTQu2tNUMJs-kkA  sky  ""  — query  sky 
add_class  zvjsox8ZTQu2tNUMJs-kkA  2014  ""  — query  2014 
add_class  zvjsox8ZTQu2tNUMJs-kkA  city  ""  — query  city 
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Figure  15:  Figure  demonstrating  the  Nightwatchman  flow  for  one  iteration  with  results  (i.e., 
pos/neg/chance/accuracy)  included  for  the  previously  issued  command  sequence. 


python  nwm.py  add_class  zvjsox8ZTQu2tNUMJs-kkA  ocean  ""  — query  ocean 
python  nwm.py  add_class  zvjsox8ZTQu2tNUMJs-kkA  forest  ""  — query  forest 
python  nwm.py  batch  bwhite@cs.umd.edu  zvjsox8ZTQu2tNUMJs 


3.3.5  Active  Learning:  Streaming 

The  previous  active  learning  approach  operates  on  a  large  corpus  of  existing  data  and  samples 
are  selected  for  annotation  in  batch;  however,  we  can  pose  the  problem  differently  and  instead 
of  deciding  what  samples  to  present  to  annotators  we  can  allow  them  to  explore  the  corpus 
at  will  and  determine  what  questions  to  ask  based  on  what  they  are  currently  viewing. 
This  approach  allows  us  to  continuously  collect  annotations  as  analysts  are  browsing  the 
dataset.  In  this  way  we  can  leverage  their  existing  workflow  and  they  are  likely  to  have  a 
deeper  understanding  about  the  imagery  they  choose  to  view  instead  of  uniform  selection. 
Moreover,  this  approach  reduces  their  cognitive  load  by  having  questions  available  while 
they  are  exploring  the  data  and  it  naturally  improves  performance  for  data  that  is  frequently 
accessed. 
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4.  Results  and  Discussion 


4.1  Results 

The  results  for  each  component  are  provided  in  their  individual  sections  and  summarized 
here.  In  Section  3.2.5  we  show  that  by  composing  independently  trained  binary  classihers 
in  a  classiher  cascade  we  can  achieve  state-of-the-art  performance  on  the  scene  recognition 
task.  Our  algorithmic  components,  described  in  Section  3.1.3,  demonstrate  that  using  our 
computer  vision  architecture  (Picarus)  we  are  able  to  build  binary/multiple  class  classihers 
and  search  indexes  without  writing  task-specihc  code.  Our  logo  recognizer  we  achieved  49% 
accuracy  (i.e.,  the  hrst  ranked  result  is  correct)  with  random  chance  on  this  problem  as 
1%.  Our  landmark  recognizer  achieved  83%  accuracy  on  photos  taken  ‘in  the  wild’  of  DC 
landmarks  with  chance  being  25%.  Our  indoor/outdoor  classiher  achieved  an  accuracy  of 
85%  with  change  being  50%  on  the  SUN397  dataset. 

4.2  Discussion 

Throughout  this  project  we  learned  several  lessons  that  would  likely  be  applicable  to  future 
applications.  This  section  summarizes  lessons  learned  that  are  more  cross  cutting  in  nature 
and  weren’t  explored  in  other  sections. 

When  dealing  with  large-scale  open-ended  problems  such  as  the  one  considered  in  this 
work,  it  is  important  to  impose  constraints  that  prevent  undue  ehort  from  being  spent  on 
the  most  difficult  classes.  Often  the  relative  ehectiveness  of  a  classiher  can  very  greatly 
between  classes  and  it  is  much  more  efficient  to  start  from  the  easiest  classes  and  move  to 
more  difficult  ones  when  their  performance  saturates. 

Human  annotation  is  expensive  and  it  is  worth  optimizing  the  annotator’s  workhow  to 
reduce  cognitive  load  and  judgement  calls.  Complex  dehnitions  make  it  substantially  more 
difficult  to  annotate,  especially  for  short  term  annotators.  It  is  much  more  ehective  to  provide 
an  intuitive  and  judgement-free  dehnition  which  may  require  breaking  one  complex  class  into 
two  more  clearly  distinguished  classes  (e.g.,  vehicle  vs  4- wheeled  vehicle)  or  combining  classes 
(e.g.,  car  vs  toyota). 

Classihers  aren’t  perfect  and  the  simpler  the  domain  they  operate  in  the  more  ehective 
they  will  be,  this  is  especially  true  when  only  a  small  training  set  is  available.  For  exam¬ 
ple,  if  every  classiher  has  to  properly  handle  non-photos  (e.g.,  logos,  heavily  edited  photos. 
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screenshots,  documents)  then  it  must  necessarily  be  more  complex;  however,  if  instead  we 
restrict  their  domain  to  photos  they  will  generally  be  more  accurate.  Moreover,  by  reducing 
the  domain  size  through  the  use  of  rejection  chain  cascades,  as  we  do  implicitly  with  our 
Shopforeman’s  type  system,  even  weak  classihers  can  jointly  form  a  strong  classiher. 

The  intense  research  effort  spent  on  individual  common  classes  (e.g.,  person,  car,  face)  is 
intractable  for  the  long-tail  of  visual  concepts.  Instead  of  focusing  on  a  single  class’s  accuracy 
it’s  more  effective  to  look  at  how  many  classes  achieve  an  operationally  useful  classihcation 
accuracy.  By  focusing  on  what  our  methods  are  able  to  classify  we  can  then  use  statistical 
models  to  infer  attributes  about  more  relevant  but  not  as  easily  observed  classes  (e.g.,  using 
scene  classihers  to  infer  the  types  of  objects  that  it  may  contain,  such  as  a  bathroom  having 
a  sink). 
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5.  Conclusion 


In  this  project  we  developed  a  prototype  Shopforeman  system  that  composes  heterogeneous 
computer  vision  algorithms  on  the  fly.  The  Shopforeman  provides  an  interface  to  the  user 
that  allows  them  to  specify  the  specific  problem  they  are  trying  to  solve  and  a  target  op¬ 
erating  point.  Additionally,  we  developed  a  complementary  Nightwatchman  system  that 
automatically  generates  novel  concept  classifiers  for  the  Shopforeman  to  use  by  iteratively 
mining  available  data  resources  (e.g.,  Flickr),  selecting  the  most  informative  samples  to  an¬ 
notate,  hire  remote  annotators  using  mechanical  turk,  and  then  train  classifiers  based  on  the 
annotation  results. 

Throughout  our  period  of  performance  we  worked  with  TAl  and  TA2  participants  to 
integrate  components  of  our  approach  including  our  classifiers,  data  storage  system  (Apache 
HBase),  and  our  in-house  computer  vision  system  (Picarus).  We  identified  a  need  for  a  way 
to  automatically  populate  the  VMR  system  with  additional  concept  classifiers  as  manually 
creating  them  is  impractical  for  the  “long  tail”  of  concepts.  Moreover,  we  demonstrated 
a  scalable  approach  to  collecting  a  dataset  for  each  concept  while  simultaneously  learning 
classifiers  for  it  using  only  computational  resources  and  untrained  human  annotators. 
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